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ABSTRACT 

New  approaches  like  perturbation  analysis  and  the  likelihood  ratio  method  have  been 
proposed  recently  to  estimate  the  gradient  of  a  performance  measure  with  respect  to  some 
continuous  parameters  in  a  dynamic  stochastic  system.  In  this  paper,  we  experiment 
the  use  of  these  estimators  in  stochastic  approximation  algorithms,  to  perform  so-called 
^single- run  optimizations^  We  also  compare  them  to  finite  difference  estimators,  with  and 
without  common  random  numbers.  The  experiments  are  done  on  a  simple  M/M/1  queue. 
The  performance  measure  involves  the  average  system  time  per  customer,  and  the  optimal 
solution  is  easy  to  compute  analytically,  which  facilitates  the  evaluation  of  the  algorithms. 
We  also  demonstrate  some  properties  of  the  algorithms.  In  particular,  we  show  that  using 
perturbation  analysis,  the  single-run  optimization  converges  to  the  optimum  even  with  a 
fixed  (and  small)  number  of  ends  of  service  per  iteration,  while  under  the  same  conditions, 
the  algorithm  that  uses  the  finite  difference  estimators  converges  to  the  wrong  answer. 
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1.  Introduction 


Simulation  has  traditionally  been  used  to  evaluate  the  performance  of  complex  systems, 
especially  when  analytic  formulas  are  not  available,  but  rarely  to  perform  optimization. 
Consider  a  (stochastic)  simulation  model  parametrized  by  a  vector  8  of  continuoxis  parame¬ 
ters,  and  suppose  one  seeks  to  minimize  the  expected  value  a(0)  of  some  objective  function. 
In  principle,  if  a{6)  is  well  behaved,  one  could  estimate  its  gradient  by  simulation,  and  use 
adapted  versions  of  classical  non-linear  programming  algorithms.  One  way  to  estimate  the 
gradient  is  to  use  finite  differences.  However,  in  most  practical  applications,  obtaining  a 
reasonably  accurate  estimate  often  require  unacceptably  high  amounts  of  computer  time. 

Recently,  new  ways  have  been  proposed  to  estimate  the  gradient  of  a  performance 
measure  (defined  as  a  mathematical  expectation),  with  respect  to  continuous  parameters, 
by  simulation  [3,  4,  5,  6,  17,  18,  21,  23].  For  ‘^steady-state”  simulations,  a  “single-run” 
optimization  scheme  has  also  been  suggested  [14,  22].  Combined  with  appropriate  variance 
reduction  techniques,  these  methods  cotild  enlarge  substantially  the  class  of  stochastic 
optimization  problems  that  can  be  solved.  However,  a  good  amount  of  research  remains 
to  be  done  on  the  theoretical  properties  of  these  methods  (some  of  them  were  proposed  as 
“heuristics”),  and  empirical  investigations  should  be  done. 

In  this  paper,  we  report  the  results  of  numerical  experiments  performed  on  simple 
queueing  systems.  The  idea  was  to  run  the  algorithms  (and  their  variants)  on  a  problem 
for  which  we  could  compute  analytically  the  optimal  solution.  Such  an  experiment  was  first 
undertaken  in  [22],  but  these  authors  looked  at  only  two  methods,  which  they  presented  as 
heuristics.  One  was  based  on  perturbation  analysis  (PA)  and  the  other  was  an  adaptation 
of  the  Kiefer- Wolfowitz  (KW)  algorithm.  They  observed  empirically  that  for  the  problem 
considered,  the  former  method  (PA)  was  converging  much  faster  than  the  latter  (KW).  We 
prove  in  this  paper  that  for  the  example  they  examined,  their  first  method  converges  to  the 
optimal  solution,  while  their  second  might  converge  to  the  wrong  answer.  We  then  suggest 
new  variants  of  KW  that  converge  to  the  optimal  solution.  For  the  (simple)  examples 
considered,  one  of  these  variants  (which  uses  common  random  numbers  and  increases  the 
simulation  length  at  each  iteration)  appears  to  be  competitive  with  PA.  We  also  experiment 
with  different  variants  of  the  likelihood  ratio  (LR)  method  [3,  4,  17,  19]  (also  called  the 
score  function  (SF)  method).  Our  numerical  experiments  deal  with  examples  where  the 
decision  parameter  vector  8  has  only  one  component.  The  multidimensional  case  certainly 
involves  more  intensive  computations.  As  always,  since  these  experiments  were  done  on 
specific  numerical  examples,  one  should  be  careful  in  making  any  generalizations. 


1 


In  section  2,  we  briefly  describe  a  stochastic  approximation  algorithm  for  steady-state 
simulation  optimization  in  a  general  setting.  More  details  are  given  in  the  appendix,  where 
we  recall  known  convergence  results,  with  slight  adaptations.  Our  aim  is  not  to  give  the 
most  general  results,  but  as  stated,  the  theorems  are  general  enough  for  a  fairly  large 
class  of  applications.  At  each  iteration,  the  algorithm  requires  an  estimate  of  the  gradient 
Va(‘)  at  a  given  point,  and  section  3  reviews  some  techniques  to  obtain  that  estimate.  Our 
presentation  is  made  in  the  steady-state  setting,  but  it  also  applies  (with  some  simplifica¬ 
tion)  to  terminating  simulations,  or  to  the  case  of  infinite  horizon  total  discounted  cost, 
by  taking  a{6)  as  the  total  expected  (possibly  discounted)  cost  at  parameter  level  9.  The 
initial  simulation  state  is  then  fixed  (or  could  be  part  of  the  parameter,  or  random  with 
known  distribution).  For  terminating  simulations,  many  things  simplify  since  the  initial 
bias  problem  disappears.  In  section  4,  we  consider  a  simple  example,  similar  to  the  one 
studied  in  [22]:  one  aims  to  minimize  a  function  of  the  average  system  time  per  customer, 
in  a  M/M/1  queue  for  which  the  decision  variable  is  the  parameter  of  the  service  time 
distribution.  We  give  the  results  of  an  extensive  numerical  investigation,  and  prove  some 
properties.  In  the  conclusion,  we  comment  on  the  possible  behavior  of  other  variants  of 
this  example,  and  mention  prospects  for  further  research. 
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2.  A  stochastic  approximation  scheme 

2.1.  The  general  form  of  the  algorithm 

Consider  a  discrete  event  simulation  whose  evolution  law  depends  on  a  parameter  vector  B, 
where  9  6  a  compact  convex  subset  of  II*^.  Let  a(0)  be  the  steady-state  cost  of  running 
the  system  at  parameter  level  B,  and  suppose  one  is  interested  in  minimizing  a(B)  over  @. 
We  assume  that  a(*)  is  continuously  differentiable. 

We  consider  a  stochastic  approximation  algorithm  of  the  form: 

Bn+i  :=  ire(B„  -  'r^Yn)-  (1) 

where  Bq  &  0  is  fixed  (or  random  with  known  distribution).  Here,  for  n  >  0,  is  the 
parameter  value  at  the  beginning  of  iteration  n,  Yn  is  sm  estimate  of  the  gradient  Va(0) 
obtained  at  iteration  n,  {7nin  >  0}  is  a  positive  sequence  decreasing  to  0  and  such  that 

denotes  the  projection  on  the  set  0  (i.e.  ire(d)  is  the  closest  point 
to  B  in  6).  Note  that  this  can  also  be  generalized  (and  the  results  will  still  hold)  to  the 
case  where  a  different  subsequence  {7n»n  ^  0}  is  chosen  for  each  component  of  B. 

Let  s„  denote  the  state  of  the  simulation  at  the  beginning  of  iteration  n  (description 
of  all  the  objects  in  the  system,  event  list,  etc.).  We  assume  that  enough  information  is 
kept  in  s„  so  that  the  pair  (^m^n)  evolves  as  a  (possibly  non-homogeneous)  Markov  chain. 
Let  So  be  the  initial  state.  There  are  different  ways  of  obtaining  and  since  it  must 
be  obtained  in  finite  time,  the  estimator  will  usually  be  (sometimes  only  slightly)  biased. 
The  computation  of  Yn  involves  say  Kr  >  1  simulation  subruns,  where  for  k  = 
the  k-th  subrun  is  performed  at  parameter  value  Bnk  =  ^nJb(^n)>  form  initial  state  s„jt,  and 
for  duration  Tnk  (possibly  random).  That  duration  can  be  expressed  for  instance  in  terms 
of  the  simulation  clock  (units  of  simulation  time),  or  as  a  number  of  regenerative  cycles, 
or  as  a  number  of  customers  to  be  served  (in  a  queueing  system),  etc.  The  functions  ^„it, 
the  rules  for  selecting  Snfc,  and  the  probability  distribution  (or  value)  of  Tnk  &11  depend  on 
the  algorithm.  We  assxime  that  there  is  also  a  rule  for  choosing  s„.).i  at  the  end  of  these 
Kr  simulation  subruns  (it  could  be  for  instance  the  final  state  of  one  of  the  Kr  subruns). 
Often,  Kr  =  1,  Sni  =  s„,  Bnk  =  and  Sn.f.i  is  the  state  of  the  simiilation  model  at  the  end 
of  iteration  n.  For  a  symmetric  finite  difference  scheme,  one  has  Kr  =  2d  and  each  Bnk  is 
Bn  plus  a  constant  times  a  unit  vector. 

Denote  by  En(’)  the  conditional  expectation  £[•  |  We  can  write  Yn  as 

y;.=Va(«„)+^n  +  £n  (2) 


3 


where  i^n[£n]  =  0*  so  that  /9„  represents  the  (conditional)  bias  on  ¥„  given  (^n>s„). 

2.2.  About  the  convergence 

Sufficient  conditions  for  the  convergence  of  (1)  to  an  optimum  (or  of  similar  algorithms, 
often  without  the  projection  operator)  have  appeared  in  many  places  (see  [9,  10,  11,  15, 
16,  18]  and  other  references  cited  there).  Some  sets  of  conditions  imply  almost  sure  con¬ 
vergence,  others  imply  only  some  form  of  weak  convergence,  but  are  often  easier  to  verify. 
In  the  appendix,  we  pve  two  sets  of  conditions  that  are  slight  adaptations  from  [9,  11]. 
One  set  is  for  convergence  with  probability  one,  the  other  one  is  for  weak  convergence. 

Roughly  speaking,  the  algorithm  should  converge  almost  siirely  if  7^  to  zero  fast 
enough  to  damp  out  in  some  way  the  variation  due  to  the  e„’s,  and  that  convergence  should 
be  to  a  zero  of  the  gradient  Va(d)  if  /?„  goes  to  zero. 
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3.  Ways  of  estimating  the  gradient 


One  crucial  ingredient  for  the  algorithm  sketched  in  the  previous  section  is  im  efficient 
gradient  estimation  technique.  In  this  section,  we  survey  some  possibilities. 

3.1.  Kiefer-Wolfowitz  (KW) 

This  method  is  described  for  instance  in  [9,  18],  without  the  projection  operator.  Here, 
y„  is  a  finite  difference  estimator.  Take  a  positive  sequence  {c„,n  >  0}  that  converges  to 
0.  Let  e,  denote  the  d-dimensional  unit  vector  with  a  1  in  position  t.  The  estimation  of 
Yn  involves  Kr  =  2d  simulation  subruns,  each  subrun  yielding  an  estimation  W~^  or 
of  the  “average”  cost  at  a  given  parameter  setting.  More  precisely,  for  i  =  l,,..,d,  we 
simulate  from  some  initial  state  s~  at  parameter  value  ire(6n  —  c„c,)  for  duration  r„  to 
obtain  IV",  and  we  simulate  from  state  parameter  value  ireC^n  +  c„Ci)  for  duration 

Tn  to  obtain  W^-.  We  then  compute 

(3) 

t=l 

where 

A„,  =  |l7re(^„  -I-  CnCi)  -  7re{^n  -  c„e,).||  (4) 

Here,  Tnk  =  Tn  for  A:  =  1, . . . ,  2d.  Again,  Tn  can  be  a  number  of  (simulated)  time  units,  or 
a  number  of  regenerative  cycles,  or  a  number  of  observations  of  some  sort,  etc.  Usually, 
W~-  and  W^-  are  averages  over  these  “pieces”  of  simulation  of  “duration”  r„.  We  can 
also  generalize  this  scheme  by  taking  different  subsequences  {c„,n  >  0}  for  the  different 
components  of  0.  Note  that  problems  might  occur  at  the  boundary  of  0:  A„i  can  be  much 
smaller  than  2c„  or  may  even  be  zero  in  bad  cases.  But  we  will  not  discuss  this  problem 
any  further  here.  If  6  is  an  interval  (or  more  generally  an  hyperrectangle),  as  will  be  the 
case  for  the  examples  treated  here,  it  is  easy  to  see  that  A„i  >  c„. 

There  are  two  sources  of  (conditional)  bias  in  One  is  due  to  the  fact  that  we  use 
finite  differences.  Call  it  0^.  It  goes  to  0  as  n  — »  oo  since  c„  —*  0.  The  second  one  is  due 
to  the  initial  state  at  the  beginning  of  each  iteration.  We  have 

-  C„e0)  +  (5) 

EnllV+l  =  +  tn.,))  +  Ki  (6) 
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where  ^3“^  and  are  the  respective  (conditional)  bias  on  the  estimate  average  costs  due 
to  the  initial  states  and  {E„  is  defined  in  section  2.1).  This  second  (conditional) 
bias  is  then 

=  (V 

i=l 

To  satisfy  assumption  S2  of  the  appendix,  we  need  that  -h  —*  0  as  n  oo.  In 

most  practical  situations,  the  bias  and  decrease  with  T„  at  a  rate  of  approximately 
l/r„.  In  that  case,  we  will  have  ►  0  if  A„,  >  KcCn  for  some  constant  Kc  >  0  and  if 

l/(TnC„)  0.  (8) 

As  c„  decreases  to  zero,  the  variance  on  Vn  usually  increases  to  infinity.  However,  if  we 
sissume  that  the  variance  of  the  sum  of  the  numerators  in  (3)  is  bounded,  assumption  S4 
of  the  appendix  can  be  satisfied  by  choosing  the  sequences  in  such  a  way  that 

£(7n/c„)^  <  OO.  (9) 

n:sO 

A  reasonable  choice  might  be  to  take  for  instance  T„  =  ia  -h  ti,n,  y„  =  7o/(n  +  1)  and 
c„  =  con~*/^  for  appropriate  constants  ia,  h,  7o  and  cq. 

Note  that  this  method  may  work  even  if  a(*)  is  not  differentiable.  For  the  unconstrained 
case  (0  =  R'^),  if  «(•)  is  twice  continuously  differentiable,  if  satisfies  S5  of  the  appendix, 
if  we  neglect  the  bias  /3^,  under  some  additional  conditions  on  the  variance  of  the  cost 
estimators,  and  if  the  sequences  are  chosen  as  suggested  above  with  t/,  =  0,  then  the 
algorithm  converges  almost  surely  to  and  the  convergence  rate  is  of  the  order  of 
(see  Theorem  2.3.5  and  chapter  VII  in  [9j).  However,  is  zero  only  for  terminating 
simulations  or  when  we  can  exploit  the  regenerative  structure.  Otherwise,  taking  tb  =  0 
may  lead  to  disaster,  as  will  be  illustrated  in  the  next  section. 

One  simple  way  to  choose  the  initial  states  of  the  subnins  is  as  follows.  Start  the  first 
subrun  from  state  s„,  then  take  the  terminal  state  of  any  given  subrun  as  the  initial  state 
of  the  next  one.  The  terminal  state  of  the  last  subrun  will  become  the  initial  state 
for  the  next  iteration.  But  still,  we  can  permute  the  2d  subruns  of  a  given  iteration  in  any 
given  way,  and  select  the  terminal  state  of  any  subrun  for  Sn+i*  It  is  not  clear  what  is  the 
best  way  of  doing  this,  if  any.  In  any  case,  the  KW  method  is  usually  plagued  by  a  huge 
variance  on  the  Yn,  which  makes  it  converge  very  slowly,  at  least  when  the  subruns  are 
performed  with  “independent”  random  numbers. 
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Glynn  [3]  describes  an  alternative  KW  approach  based  on  regenerative  analysis.  It 
eliminates  the  biais  but  the  variance  is  usually  much  higher. 

The  package  SAMOPT  [1]  is  an  implementation  of  the  stochastic  optimisation  algo¬ 
rithm  with  KW,  with  specially  tuned  parameters.  It  was  designed  for  terminating  simula¬ 
tions.  It  also  replaces  Yn  by  its  sign. 

3.2.  Kiefer-Wolfowitz  with  common  random  numbers  (KWC) 

One  way  to  reduce  the  variance  in  KW  is  to  use  common  random  numbers  across  the 
subruns  at  each  iteration,  start  all  the  subruns  from  the  same  state,  and  synchronize.  Of 
course,  there  is  no  guaranteed  variance  reduction,  but  since  the  subruns  are  aimed  at  com¬ 
paring  very  similar  systems,  especially  when  c„  is  small,  considerable  variance  reductions 
might  be  obtained.  The  starting  state  s„+i  for  the  next  iteration  can  be  anyone  of  the  2d 
terminal  states.  A  heuristic  rule  is  to  choose  the  state  that  was  obtained  from  the  subrun 
with  the  parameter  value  the  closest  to  the  new  parameter  value  d„+i. 

Implementing  this  method  for  complex  simulations  is  not  without  pain.  Saving  the 
simulation  state  means  saving  the  states  of  the  random  number  generators,  the  event  list, 
the  values  of  all  variables  related  to  the  model  (but  not  those  related  to  the  top  level 
algorithm),  all  the  objects  in  the  model,  etc.  In  practice,  many  objects  in  the  model  are 
pointers  to  data  structures  that  can  be  created,  modified  or  destroyed  dynamically,  and 
whose  types  have  been  defined  by  the  programmer.  When  saving  the  state  of  the  system, 
one  cannot  onlv  save  the  pointer  values,  but  must  make  an  explicit  “backup”  copy  of  all 
these  structures.  When  restoring  the  system  to  a  given  state,  these  must  be  recopied  again. 
Usually,  the  simulation  package  cannot  do  that  and  specific  code  must  be  written.  In  fact, 
it  would  be  very  difficult  to  implement  “state  saving”  facilities  in  a  general  simulation 
package,  because  usually,  the  package  has  no  way  to  know  with  certainty  the  structures 
of  all  the  dynamic  objects  created  by  the  user.  All  this  implies  overhead  not  only  for 
the  computer,  but  also  for  the  programmer.  Another  source  of  programming  overhead  in 
KWC  comes  from  the  need  to  insure  synchronization  of  the  random  numbers  across  the 
subruns. 

When  c„  is  small,  there  is  sometimes  little  change  between  the  sample  paths  of  the 
2d  subruns.  One  could  then  ask:  is  it  possible  to  perform  only  one  subrun  and  trace 
the  few  changes  ?  It  is  indeed  sometimes  possible,  and  this  idea  leads  to  what  is  called 
finite  perturbation  analysis.  Tsdung  that  to  the  limit  when  Cn  goes  to  zero,  one  obtains 
infinitesimal  perturbation  analysis. 
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3.3.  Perturbation  analysis  (PA) 


In  this  paper,  PA  always  refers  to  infinitesimal  perturbation  analysis  [5,  6,  21,  23].  The 
basic  idea  is  to  generate  a  sample  path  u>,  viewed  as  a  sequence  of  1^(0, 1)  variates,  and 
for  u;  fixed,  observe  the  effect  of  an  infinitesimal  perturbation  on  B  by  propagating  it  over 
the  sample  path.  Such  a  propagation  can  be  done  rather  easily  if  one  assumes  that  the 
perturbation  on  B  does  not  change  the  sequence  of  events,  but  only  makes  them  “slide 
smoothly”  in  time.  The  gradient  estimate  is  then  taken  as  the  gradient  of  the  sample 
objective  function  for  that  fixed  value  of  w,  say  Vsh(tf,u>).  Unfortunately,  it  is  not  always 
true  that 

E[V8h{B,^)]  =  V,£?[h(tf,a;)],  (10) 

since  we  cannot  always  permute  the  derivative  and  the  expectation.  Heidelberger  et  al. 
[6]  give  conditions  under  which  PA  works  correctly.  If  PA  does  not  work  for  the  original 
system,  various  devices  can  sometimes  be  used  to  “smooth  out”  or  transform  the  original 
problem  into  a  problem  for  which  PA  will  work  correctly  (see  [5]  for  instance).  These 
devices  are  usually  problem- dependent.  In  principle,  when  (10)  holds,  PA  can  be  viewed 
as  a  limiting  version  of  KWC  as  each  c„  becomes  infinitesimal.  However,  as  we  will  see 
in  section  4,  implementation  “details”  can  often  make  a  big  difference  between  PA  and 
“infinitesimal”  KWC.  One  big  advantage  of  PA  when  it  works  is  that  it  requires  only  one 
simiilation  subrun  per  iteration,  compared  to  2d  for  KW  or  KWC. 

3.4.  Likelihood  ratio  (LR) 

The  likelihood  ratio  graidient  estimation  method  has  been  introduced  recently  by  Glynn  [3, 
4],  Reiman  and  Weiss  [17],  and  Rubinstein  [19,  20]  (who  calls  it  the  score  function  method). 
The  basic  idea  is  that  a{B)  can  usually  be  viewed  as  the  expectation  of  some  function  of 
B  and  of  the  sample  path  u;,  say  h{B,u>),  with  respect  to  some  probability  measure  P${-) 
over  some  sample  space  il.  Here,  uf  represents  all  the  randonmess  that  drives  the  system, 
so  that  when  cj  is  fixed,  everything  becomes  deterministic.  Typically,  u  can  be  viewed 
as  a  sequence  of  random  variables,  some  of  which  have  a  distribution  that  depends  on 
B.  Usually,  one  cannot  differentiate  this  expectation  directly  by  differentiating  inside  the 
integral,  because  Pe{')  depends  on  By  but  for  some  Bn  €  0,  one  can  rewrite 

a(*)  =  fk(e,w)P,(dw)  =  f  (11) 
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and  differentiate  a(’)  by  differentiating  the  bracketted  term  with  respect  to  6  inside  the 
integral  (0„  is  viewed  as  a  constant).  This  expression  can  then  be  evaluated  at  0  to 
obtain 

Va((»„)  =  j  ^  ^  (12) 

The  bracketted  expression  in  (12)  can  be  used  to  estimate  Va(9„).  As  in  PA,  only  one 
simulation  subrun  is  required  to  estimate  the  gradient.  (In  principle,  the  expression  can 
also  be  evaluated  at  any  value  of  ^  ^  yielding  an  estimate  of  the  gradient  everywhere 
in  0,  obtained  by  a  single  simulation.  The  method  can  also  be  generalized  to  higher  order 
derivatives.) 

Note  that  the  above  reasoning  holds  only  if  the  derivative  and  expectation  can  be 
interchanged  (this  can  be  done  under  some  regularity  conditions,  see  [17,  20]),  and  if  the 
likelihood  ratio  (or  Radon-Nikodym  derivative)  exists  in  a  neighborhood  of  0„,  i.e.  if  the 
sets  of  positive  P9(-)-measure  are  the  same  for  all-in  that  neighborhood.  These  conditions 
limit  the  method.  For  instance,  threshold-type  parameters  that  influence  deterministically 
the  occurence  (or  occurence  time)  of  some  events  are  ruled  out.  Another  case  where  it 
won’t  work  is  when  a  component  of  0  represents  a  transition  probability  of  some  sort, 
and  that  probability  goes  to  0  or  1.  This  could  happen  easily  when  optimizing  transition 
probabilities.  Also,  the  regularity  conditions  permitting  one  to  interchange  the  derivative 
and  expectation  are  often  satisfied,  but  not  always. 

For  steady-state  simulations,  there  is  again  the  bias  problem,  since  only  a  finite  portion 
of  the  sample  path  can  be  simulated.  A  simple  remedy  is  to  use  the  same  idea  as  in  KW: 
increase  the  simulation  length  at  each  iteration.  However,  the  variance  of  Yn  here  usually 
increases  linearly  with  Tni  so  Tn  should  not  be  increased  too  rapidly.  Assuming  that  the 
variance  is  proportional  to  Tn,  one  can  take  7„  =  7o/n  and  Tn  =  ta  +  U^’’  for  0  <  p  <  1,  and 
the  conditions  SI  and  S4  of  the  appendix  will  be  satisfied  by  taking  Another 

choice  could  be  Tn  =  ta  +  t6bm-  In  nny  case,  the  fact  that  the  variance  increases  with  the 
simulation  length  is  an  important  limitation  of  LR  in  general.  We  should  expect  LR  to 
work  much  better  for  terminating  simulations  for  which  only  a  small  number  of  random 
variates  are  generated  using  a  given  parameter. 

Other  variants  of  the  LR  approach  circumvent  the  bias  problem  by  using  a  regenerative 
approach  [3,  4,  17|.  If  the  system  possesses  a  readily  identifiable  regenerative  structure, 
a(-)  can  be  written  as  the  quotient  of  two  functions,  and  a  likelihood  ratio  approach  can  be 
used  to  obtain  an  estimator  of  the  derivative  of  the  quotient,  for  each  component  of  0.  See 
[3,  4,  17]  for  more  details.  The  methods  proposed  in  [4,  17]  still  have  some  bias,  since  they 
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estimate  the  expectation  of  a  ratio,  but  the  variance  does  not  increase  with  the  simulation 
length.  It  decreases  linearly  instead.  Also,  the  bias  goes  to  zero.  Here,  T„  can  be  taken 
as  the  number  of  regenerative  cycles  at  iteration  n.  On  the  other  hand,  algorithm  B  in  [3] 
proposes  an  unbiased  estimator.  However,  its  variance  is  usually  very  high.  According  to 
our  experience,  its  practical  applicability  appears  to  be  somewhat  limited. 
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4.  Example:  a  M/M/1  queue 

4.1.  The  basic  model  and  experimental  setup 

This  example  is  strongly  inspired  from  Suri  and  Leung  [22].  Consider  a  M/M/1  queue 
with  arrival  rate  A  =  1  sind  mean  service  time  0  €  0  =  [a, &]«  where  0  <  a  <  b  <  1. 
Here,  9  is  the  parameter  to  be  optimized.  Let  Ci  >  0  be  a  constant,  and  let  w(^)  be  the 
average  sojourn  time  in  the  system  per  customer,  in  steady-state,  at  parameter  level  9. 
The  objective  function  (to  be  minimized)  is  defined  by 

a(^)  =t»(^)-|-C,/tf.  (13) 

The  optimal  value  9’  can  be  computed  quite  easily  in  this  case.  Indeed,  w{9)  =  9/{l  —  9) 
and  9'  =  -I-  y/C^)  (if  this  value  is  not  in  [a,b],  the  optimum  is  at  the  nearest 

boundary).  But  we  can  ignore  momentarily  these  results,  and  try  to  solve  the  problem 
using  the  stochastic  algorithm  described  in  section  2,  combined  with  different  variants  of 
the  gradient  estimation  techniques  described  in  the  previous  section.  The  solutions  can 
then  be  compared  to  the  true  optimal  solution  for  an  empirical  evaluation.  Henceforth,  we 
will  refer  to  these  variants  as  different  algorithms. 

We  actually  performed  such  an  experiment  u  follows.  For  each  algorithm,  we  made  N 
simulation  runs,  each  yielding  an  estimation  of  9".  The  N  initial  parameter  values  were 
randomly  chosen,  uniformly  in  [a,  b],  and  the  initial  state  (so)  was  an  empty  system.  Across 
the  algorithms,  we  used  common  random  numbers  and  the  same  set  of  initial  parameter 
values.  Each  run  was  stopped  after  a  (fixed)  total  of  T  ends  of  service,  but  that  stopping 
criterion  was  checked  only  between  iterations.  Hence,  all  algorithms  had  about  the  same 
chance  and.  if  we  neglect  the  differences  in  overhead  for  the  gradient  estimation  techniques, 
used  about  the  same  CPU  time.  (The  overhead  was  quite  low  in  general,  except  for  very 
small  values  of  Tn,  like  r„  =  1.)  The  programs  were  written  using  SIMOD  [12],  a  simulation 
package  based  on  the  language  Modula-2.  Each  customer  had  a  record  associated  with  it, 
to  keep  track  of  its  arrival  time.  These  records  were  created  dynamically  and  the  waiting 
customers  wer^  .'ut  explicitly  into  a  linked  queue.  For  this  example,  some  methods,  like 
PA  for  insta.'.  c-  lo  not  require  keeping  track  of  the  individual  arrival  times,  and  in  that 
case  we  could  hav'*  saved  a  fair  amount  of  CPU  time  by  keeping  only  a  counter  instead 
of  a  linked  oueue.  dowever,  we  insisted  on  using  exactly  the  same  simulation  program  for 
all  algorithms.  In  fact,  the  simulation  model  and  the  algorithms  were  implemented  in  two 
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different  modules,  the  latter  being  totally  model  independent.  We  also  avoided  using  any 
variance  reduction  technique. 

For  each  algorithm  variant,  we  computed  the  empirical  mean  $,  standard  deviation 
Sd  and  standau’d  error  ««  of  the  N  retained  parameter  values,  and  also  traced  graphics  of 
the  evolution  of  these  quantities  sis  functions  of  the  total  number  of  ends  of  service.  If 
yi  denotes  the  retained  parameter  value  for  run  t  (i.e.  the  last  value  of  ffn),  the  above 
quantities  are  defined  by 


^  E  y.;  =  ^  ~ 


We  also  computed  a  confidence  interval  le  on  the  expectation  of  Sy  assunnng  that  \/N — 
E{S))/sd  follows  a  Student  distribution  with  JV  —  1  degrees  of  freedom. 


4.2.  Implementing  the  gradient  estimation  methods 

Here  are  some  specific  implementation  details  for  the  different  gradient  estimation  tech¬ 
niques.  In  this  example,  Tn  represents  the  number  of  ends  of  service  for  each  subrun  of 
iteration  n,  except  for  the  regenerative  methods,  where  it  represents  the  number  of  re¬ 
generative  cycles  per  subrun.  Hence  each  iteration  ended  at  a  time  when  a  customer  was 
leaving  the  system.  Of  course,  if  the  queue  was  not  empty  at  that  point,  we  were  careful 
to  generate  the  new  service  time  only  at  the  beginning  of  the  next  iteration,  i.e.  afier 
the  parameter  was  modified.  For  all  the  methods  except  KWC,  the  final  state  of  every 
simulation  subrun  was  taken  as  the  initial  state  for  the  next  one.  For  the  regenerative 
approaches,  a  new  regenerative  cycle  began  at  the  end  of  each  busy  period. 

For  the  KW  method,  the  average  cost  was  estimated  for  each  subrun  j  of  iteration  n  by 
Ci/$n  +  2nj,  where  Znj  was  the  average  system  time  for  the  customers  who  leaved  during 
that  subrun. 

For  PA,  the  gradient  estimation  can  be  computed  as  follows  (see  [22,  23]).  Every  time 
a  customer  leaves  the  system,  we  look  at  the  elapsed  time  since  the  beginning  of  the  busy 
period.  Let  i/„  be  the  average  of  these  times  for  the  customers  leaving  at  iteration  n. 
The  gradient  estimate  is  then 

Y,y=Ujeyy-C,fel  (iS) 

For  LR,  we  can  view  momentarily  a(^n)  in  (11)  and  (12)  as  representing  the  expected 
average  cost  during  the  subrun  of  iteration  n  given  the  initial  state  Sn,  and  u)  as  the  T„ 
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service  times  generated  during  that  iteration.  Hence,  at  ^ 


^  ePs(_dit/) 


d  .  n  /  J  \ 

=  5jli,P,(<iu.)=5^1n 


(16) 


where  (j  is  the  service  time  of  the  j-th  customer  and  fe{C)  =  (l/^)cxp(“C/^)  is  the  service 
time  density.  After  easy  manipulations,  one  obtains 


(17) 


where  z„  is  the  average  system  time  for  the  customers  who  leaved  during  that  subrun. 


Explicit  formulas  for  the  quantities  involved  in  algorithm  A  of  [4],  which  implements  a 
regenerative  approach,  can  be  derived  in  a  similar  way.  In  that  case,  one  has 


yn  = 


QiQi  ~  Q2Q4 


where  for  k  =  1,2, 4, 5, 

i=l 

and  for  y  =  1, . , . ,  r„,  Qji  is  the  number  of  departures  during  the  ji-th  regenerative  cycle, 
Qj2  is  the  total  system  time  for  those  Qji  customers  who  left  during  that  cycle,  Qje  is 
their  total  service  time  (i.e.  the  length  of  the  busy  cycle),  Qn  =  QjiiQje  —  and 

Q]5  =  Q]2Qj4lQj\- 

We  also  implemented  the  regenerative  algorithms  described  in  [3]  (with  and  without 
the  airctan  transformation),  SAMOPT  [1],  and  other  variants,  for  which  we  will  not  give 
the  details  here.  Algorithm  B  in  [4]  is  not  applicable  for  this  example,  because  our  cost 
function  is  not  exactly  of  the  form  given  in  equation  (4.2)  of  [4]  (the  residence  time  of  a 
customer  depends  not  only  on  its  own  service  time,  but  also  on  previous  ones).  Kesten  [7] 
has  proposed  an  heiiristic  rule  under  which  instead  of  diminishing  7„  at  each  iteration,  one 
diminishes  it  only  when  the  sign  of  the  gradient  estimate  (for  one  parameter)  is  different 
from  the  one  of  the  previous  iteration  (i.e.  when  the  change  on  the  parameter  changes 
direction).  The  heuristic  idea  is  that  if  the  parameter  keeps  moving  in  the  same  direction, 
it  should  be  because  we  are  still  far  away  from  the  optimum  and  so,  lets  give  it  a  chance  to 
move  faster.  That  heuristic  might  help  in  situations  where  we  start  really  far  away  from 
the  optimum,  and  where  the  change  on  the  parameter  at  each  iteration  tends  to  be  very 
small.  We  have  implemented  this  rule  for  some  of  our  experiments. 
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4.3.  Numerical  results 


Figure  1  is  a  graphical  representation  of  some  of  the  results  of  an  experiment  we  made 
with  T  =  10®,  N  =  10,  [a,fc]  =  [0.01,0.95]  and  Ci  =  1.  The  optimal  solution  is  =  0.5. 
The  description  of  the  algorithm  variants  appear  on  the  left.  For  each  variant,  the  white 
bar  represents  the  standau'd  deviation  »d  <^4  dark  bar  represents  the  standard  error 
Sg.  Numerical  values  are  given  in  the  first  column  of  table  1  (which  appears  further  down). 
We  computed  the  95%  confidence  intervals  1$  as  described  in  section  4.1,  and  the  entries 
for  which  1$  does  not  contain  9*  are  indicated  in  table  1.  KW,  KWC,  PA  and  LR  have  the 
same  meaning  as  in  section  3.  LRR  refers  to  algorithm  A  in  [4].  PAR  means  perturbation 
analysis  with  a  regenerative  approach:  each  iteration  is  comprised  of  T„  regenerative  cycles 
instead  of  T„  ends  of  service.  The  symbol  (K)  following  the  name  of  the  algorithm  signifies 
that  Kesten’s  rule  was  used.  The  symbol  (S)  following  KW  means  that  instead  of  always 
simulating  first  at  Bn  —  Cn  and  then  at  Bn  +  Cn,  we  adopted  the  following  heuristic  rule  for 
KW:  if  the  parameter  has  just  decreased,  simulate  first  on  the  right  (at  +  c„),  otherwise 
simxilate  first  on  the  left.  The  rationale  is  that  if  the  parameter  has  just  decreased,  the 
current  state  has  been  reached  by  simtdating  at  a  parameter  value  larger  than  d„,  and 
should  thus  be  a  statistically  ‘‘better”  initial  state  for  a  simulation  at  +  Cn  than  at 

—  Cn  (and  symmetrically  if  the  parameter  has  just  increased).  In  aU  cases,  we  had 
7n  =  l/n  and  for  KW  and  KWC,  we  took  c„  =  (Intuitively,  Cn  =  would 

have  given  much  too  wide  finite  difference  intervals,  since  even  after  1000  iterations,  we 
would  still  have  2c„  =  0.2,  but  we  also  tried  it  and,  surprisingly,  the  results  were  about  as 
good). 

We  can  see  that  PA  performs  very  well,  even  when  T„  is  fixed  at  a  small  constant. 
Surprisingly,  KWC  with  a  linearly  increasing  T„  is  almost  as  good.  When  T„  is  fixed  to  a 
small  constant,  KWC  also  converges  rather  quickly  (small  s,^),  but  the  standard  error  is 
very  large,  which  means  that  it  does  not  converge  to  the  right  place!  Even  for  T„  =  100,  the 
bias  is  still  quite  apparent  and  Ig  does  not  contain  B*.  KW  has  about  the  same  behavior, 
but  with  larger  variance.  KW(S)  is  slightly  better  than  KW,  but  not  competitive  with 
KWC  or  PA.  PAR  also  has  a  bias  problem  when  r„  is  fixed.  The  problem  is  that  with  the 
regenerative  approach,  the  number  of  ends  of  service  during  the  Tn  regenerative  cycles  is 
now  random,  and  we  get  a  bias  due  to  the  fact  that  we  estimate  a  ratio  with  that  number 
on  the  denominator.  Of  course,  this  bias  goes  to  zero  as  Tn  goes  to  infinity,  and  this  is 
why  PAR  with  Tn  =  n  works  fine. 

The  LR  methods  in  general  have  some  trouble  due  to  their  large  associated  variance. 
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Figure  1:  Performance  of  various  methods  for  Ci  =  1,  T  =  10^  and  N  =  10. 


White  bar  is  Sd  and  black  bar  is  s^. 
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That  varizince  stays  low  when  T„  grows  slowly,  but  then,  the  bias  becomes  more  of  a 
problem.  LR  with  T„  =  n’’  has  large  Tariance  for  large  p,  and  for  small  p,  the  bias  goes 
down  much  too  slowly  compared  to  the  variance.  The  result  is  that  the  confidence  interval 
Isi  based  on  the  N  final  values  of  0„,  is  very  likely  not  to  cover  6^.  This  is  what  happens, 
for  instance,  with  p  =  1  /3.  The  only  LR  variant  that  gives  reasonably  good  results  here  is 
LRR,  based  on  a  regenerative  approach,  with  T„  increasing  linearly.  With  T„  =  both 
LRR  and  KWC  have  the  same  bias  problem  as  described  above:  the  bias  goes  down  too 
slowly  and  1$  does  not  contain  0*.  Nevertheless,  they  converge  (slowly)  to  the  right  answer 
(we  verified  it  empirically  with  longer  simulation  nms).  Kesten’s  rule  doesn’t  appear  to 
help  for  any  of  the  methods.  SAMOPT  [1]  and  the  algorithms  described  in  [3]  gave  rather 
bad  restilts  (huge  variances)  and  they  do  not  appear  in  the  figure.  They  are  obviously 
not  competitive,  at  least  for  this  example.  The  problem  with  SAMOPT  is  that  near  the 
optimum,  the  gradient  is  very  small  in  absolute  value,  and  replacing  it  by  its  sign  is  really 
not  a  good  idea.  We  also  obtained  bad  results  with  other  variants,  like  for  instance  PA 
with  Tn  =  100  -f-  n  but  7„  =  l/>/n  instead  of  1/n.  Other  sets  of  experiments  were  also 
done  with  T  =  10^  and  the  results  were  quite  similar  to  the  ones  given  here. 

Figures  2  to  6  illustrate  the  evolution  of  the  parameter  as  a  function  of  the  simulation 
length,  for  different  algorithms.  Each  curve  is  the  average  of  the  N  =  10  curves  associated 
with  the  individual  runs.  On  the  horizontal  axis,  one  has  the  total  number  of  ends  of 
service  to  date  in  the  simulation  run,  and  on  the  vertical  axis,  the  average  of  the  N  values 
of  0n  observed  at  that  point  for  the  N  runs.  Each  curve  was  actually  drawn  by  linear 
interpolation,  using  101  data  points  equally  spaced  on  the  horizontal  axis  (every  10^  ends 
of  service). 

Figures  2-3  show  how  KWC  converges  much  better  than  KW,  and  how  KWC  with 
T„  =  5  converges  to  somewhere  around  0.65. 

In  figures  4-5,  we  see  the  convergence  of  PA  for  T„  =  1  and  Tn  =  10.  It  is  interesting 
to  observe  here  that  the  evolution  of  the  parameter  with  these  two  values  of  T„  is  almost 
identical.  We  also  made  other  experiments  with  different  (constant)  values  of  T„  between  1 
and  1000,  and  observed  that  the  evolution  was  almost  identical  in  all  cases,  independently 
of  T„.  Note  that  all  these  simulations  were  done  with  common  random  ntimbers  and  same 
sets  of  starting  parameter  values.  Moreover,  even  if  the  starting  values  are  different,  the 
evolution  is  again  almost  identical  if  common  random  numbers  are  used. 
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Figure  2:  Convergence  of  KW  (in  white)  and  KWC  (in  black)  with  Tn  =  n. 
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Figure  3:  Convergence  of  KWC  with  Tn  =  5. 
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Figure  6:  Convergence  of  LRR  with  r„  =  n. 
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Figure  7:  Evolution  of  Se  (N=100)  for  KWC  (black),  PA  (white)  and  LRR  (star). 
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.15255 

KWC,  r„  =  100 

KWC,  =  n 

.02860 

.02773 

KWC,  Tn  =  100  +  n 

.00189 

.00212 

KWC(K),  r„  =  n 

.00216 

.00242 

KWC(K),  r„  =  100  +  n 

.00189 

.00212 

KWC,  Tn  = 

.00261* 

.00731 

PA,  r„  =  1 

.00227 

.00217 

PA,  r„  =  10 

.00227 

.00216 

.00053 

.00051 

.02402 

.02575 

PA,  Tn  =  100 

.00229 

.00219 

PA,  Tn  =  100  +  n 

.00203 

.00192 

.00045 

.00043 

.02685 

.02848 

PA(K),  Tn  =  100  +  n 

.00203 

.00192 

PAR,  r„  =  5 

.00228* 

.06175 

PAR,  Tn  =  n 

.00200 

.00197 

.00046 

.00044 

.02981 

.03110 

LR,  Tn  = 

.01221* 

.02062 

LR,  Tn  = 

.03012 

.02876 

.02454 

.02355 

.04473 

.05214 

LR,  r„  =  10  +  ni/2 

.03300 

.03165 

LR,  Tn  = 

.07494 

.07115 

LRR,  Tn  —  n 

.00447 

.00453 

.00124 

.00118 

.07608 

.07446 

LRR,  Tn  —  10  -|-  n 

.00478 

.00483 

LRR,  Tn  =  ni/2 

.00443* 

.01775 

Table  1:  Some  experimental  results  for  T  =  10®,  iV  =  10  and  Ci  =  1, 1/25,25. 
For  the  values  marked  with  an  asterisk,  the  computed 
95%  confidence  interval  does  not  contain 


20 


Figure  6  shows  the  evolution  for  LRR  with  Tn  =  n.  The  variation  is  larger  here  than 
for  KWC  and  PA. 

In  figure  7,  we  can  see  the  evolution  of  the  standard  error  Sg  with  the  simulation  length, 
for  three  of  the  most  interesting  methods.  These  graphs  are  the  resiilt  of  a  new  experiment, 
with  longer  and  much  more  numerous  rtms:  we  took  T  =  2  x  10®  and  N  =  100.  They  were 
also  made  by  linear  interpolation  using  101  points.  Again,  we  see  that  KWC  and  PA  are 
roughly  comparable,  and  that  LRR  has  about  twice  their  standard  error  (four  times  more 
variance).  From  these  graphs,  we  can  get  an  idea  of  the  convergence  rates:  the  standard 
error  gets  approximately  cut  in  half  when  the  simulation  length  is  multiplied  by  four  (i.e. 
the  variance  is  approximately  inversely  proportional  to  the  simulation  length). 

Simple  (heuristic  and  approximate)  statistical  tests  can  also  be  made  on  the  convergence 
rates.  We  can  perform  for  instance  another  200  independent  runs  of  length  T  =  500000, 
compute  the  associated  variance  s^,  and  compare  it  to  the  variance  obtained  in  the  above 
experiment,  using  the  F  statistic.  Under  the  nidi  hypothesis  Hq:  “the  convergence  rate  is 
/~i/2»  (inhere  i  denotes  the  simulation  length),  the  variance  for  T  should  be  approximately 
four  times  the  variance  for  T,  and  the  statistic  F  =  ^a\la\  should  follow  (approximately) 
a  F  distribution  with  (99,199)  degrees  of  freedom.  We  performed  these  tests  for  KWC 
with  Tn  =  n,  PA  with  Tn  =  10  and  LRR  with  Tn  =  n,  and  obtained  the  F  values  of 
1.166,  0.0826  and  1.045  respectively.  Since  we  know  that  the  convergence  rate  cannot  be 
faster  than  we  can  test  the  nviU  hypothesis  against  the  alternative  that  it  is  for 

z  <  1/2.  In  that  case,  Ha  is  rejected  at  a  95%  level  when  F  >  1.32,  which  is  the  case  for 
none  of  the  three  algorithms  here.  It  should  be  pointed  out,  however,  that  despite  the  huge 
amount  of  simulation  time  they  require,  these  tests  are  not  very  powerful.  If  we  suppose 
for  instance  that  the  true  convergence  rate  is  the  probability  of  getting  F  >  1.32  is 
actually  slightly  smaller  than  0.5.  Also,  the  convergence  rate  is  an  asymptotic  expression, 
and  since  we  use  finite  values  of  T  and  T,  we  are  only  testing  some  approximation  of  it. 

How  does  the  speed  of  convergence  of  to  9"  compare  to  the  speed  of  convergence 
of  the  cost  estimator  to  the  true  average  cost  when  0  is  fixed  ?  We  note  that  simply 
comparing  the  widths  of  the  confidence  intervals  at  the  end  doesn’t  make  sense,  since  the 
parameter  and  the  cost  are  not  necessarily  measured  on  the  same  scale.  Dividing  by  the 
means  to  obtain  relative  values  doesn’t  make  sense  either,  there  might  be  cases  where  9‘ 
or  the  average  cost  is  zero  or  near  zero.  In  any  case,  it  is  well  known  that  the  average 
cost  estimator  converges  at  rate  and  we  have  observed  the  same  convergence  rate 

for  This  means  that  we  can  estimate  the  optimum  about  as  fast  (in  terms  of  rates). 
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as  we  can  estimate  the  cost  at  a  given  point!  We  tried  the  following  numerical  example: 
with  9  fixed  at  0.5,  we  performed  100  independent  simulation  runs  of  length  T  =  2  x  10^ 
to  estimate  the  average  cost.  The  sample  mean,  standard  deviation  and  mean  square  error 
of  these  100  values  were  3.00001,  0.00214  and  0.00213  respectively.  Figure  8  illustrates  the 
evolution  of  the  standard  error  of  the  100  values  in  terms  of  the  simulation  length. 

0.010 

0.008 

0.006 

0.004 

0.002 

0.000 

0  500000  1000000  1500000  2000000 

Simulation  length 

Figure  8:  Evolution  of  the  standard  error  for  the  average  cost  estimation 
with  N  =  100,  T  =  2  X  10®,  9  =  0.5  (actual  cost:  3.0). 

We  made  other  sets  of  experiments  with  Ci  =  1/25  (for  which  9’  =  1/6)  and  C\  =  25 
(for  which  9*  =  5/6).  The  results  appear  in  table  1.  For  Ci  =  1/25,  the  traffic  intensity  for 
9  near  9‘  is  low,  and  we  get  a  much  lower  variance  than  for  Ci  =  1.  The  opposite  is  true 
for  Cl  =■  25.  The  relative  "rankings’’  of  the  algorithms  are  about  the  same.  However,  for 
KWC  and  Ci  =  1/25,  the  variance  for  9„  goes  down  quickly  and  the  bias  does  not  go  to 
zero  fast  enough  to  cope  with  that.  The  result  is  that  for  this  experiment,  the  confidence 
interval  1$  does  not  contain  9".  A  possible  remedy  is  to  increase  Tn  faster.  But  in  any 
case,  this  shows  that  one  must  be  very  careful  about  confidence  intervals  in  these  kinds  of 
experiments,  even  if  they  are  asymptotically  valid.  For  Cj  =  25,  LR  is  now  better  than 
LRR  {9  is  larger  and  the  regenerative  cycles  are  much  longer  in  this  case). 
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4.4.  Proving  some  specific  convergence  properties 

We  will  now  look  more  closely  at  why  for  our  example,  for  r„  constant,  PA  works  fine 
while  KW  and  KWC  don’t.  To  keep  things  simple,  let’s  examine  KWC  with  r„  =  1. 

Lemma  1.  With  Ci  =  1,  0  =  [a,  6]  where  0  <  a  <  6  <  1,  Tn  =  1,  and  a  sequence 
>  0}  satisfying  (9),  KWC  converges  to  b  almost  surely. 


PROOF.  When  we  estimate  the  average  cost  using  r„  =  1,  we  actually  look  at  the 
time  spent  in  the  system  by  one  customer,  i.e.  the  customer  being  served  in  that  subrun. 
This  time  can  be  expressed  as  where  is  the  inverse  of  the  service  time 

distribution  (we  assume  that  the  inversion  method  is  used  to  generate  the  service  times), 
tt  is  a  f7(0, 1)  variate,  and  is  the  (waiting)  time  already  spent  in  the  system  by  that 
customer.  is  independent  of  what  happens  during  that  subrun,  and  =  0  if  the 
subrun  began  with  an  empty  system.  Hence,  the  average  cost  estimate  at  parameter  level 


6  can  be  written  as 


h{0,u)==K^^Fr{u)  +  -. 


and  for  KWC,  we  obtadn 

_  -  HK.u)  _  +  I/K  -  UK 

*  n  ^ 


0+  -  0- 
''n 


where  0~  =  -  c„)  =  max(a,^n  -  Cn)  +  c„)  =  min{b,0„  +  c„).  Note 

that  0n  ~  ^  For  the  exponential  case,  we  have  E[Ff^{u)  \0]  =  0  and  thus 


£^n[y;]  = 


=  1  + 


which  converges  to  1  —  1/^J  as  c„  goes  down  to  zero.  Since  0n  <  6  <  1,  that  expectation 
converges  to  a  negative  value  everywhere  on  0.  This  means  intuitively  that  0n  will  be 
"attracted”  towards  the  upper  bound  b  and  we  can  prove  it  using  Theorem  2.  Let  us 
redefine  for  the  moment  a(')  such  that 

Va(^)  =  =  1-4- 

^  ’  d$  0^ 

In  that  case,  in  equation  (2)  we  have  0n  =  1/^n  ”  — »  0  as  n  — »  oo,  and  < 

Ktl{0*  —  )*  <  Ftfc^  for  some  constant  K^.  Hence,  Si,  S2  and  S4  are  satisfied,  with 

=  c^fKt  in  S4.  Since  ^  =  6  is  in  this  case  the  unique  asymptotically  stable  point  of 
(26),  S5  is  satisfied  with  A  =  0,  and  Theorem  2  applies.  | 

For  PA  with  constant  T„,  we  don’t  have  ►  0  as  n  — »  oo.  Instead  of  using  Theorem 
2,  we  will  use  Theorem  4  and  prove  a  weaker  convergence  result. 
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Lemma  2.  The  algorithm  of  tection  i  with  PA,  {7n,«  >  0}  tatitfying  WS  of  the  appendix, 
and  conttant  Tn  converges  in  probability  to  the  optimum  9, . 

PROOF.  We  need  to  verify  assumptions  Wl  to  W7  and  the  result  will  follow  from 
Theorem  4.  Note  that  most  complications  arise  because  of  the  non-compactness  of  the 
state  space  (the  queue  length  is  unbounded).  Let  T„  =  J,  a  constant.  When  using  PA  in 
this  case,  the  only  essential  things  that  we  must  know  at  the  beginning  of  each  iteration 
are  the  number  of  customers  in  the  system,  and  the  elapsed  time  since  the  beginning  of 
the  current  busy  period  (=  0  if  system  is  empty).  Let  and  m„  be  the  values  of  these 
two  quantities  at  the  be^nning  of  iteration  n.  To  verify  Wl,  we  also  put  in  s„4.i  the 
values  Unj,j  =  1,...,  J,  where  is  the  elapsed  time  since  the  beginning  of  the  current 
busy  period  when  the  j-th  departure  during  iteration  n  occurs.  Note  that  as  defined  in 
section  4.2  is  the  average  of  these  i/„j’s.  For  n  >  0,  let 

=  (lr,,m„,i/„i,...,i/„j)  €  5  =  [0,oo)  X  {0,1,2,...}  x  [0,00)-^, 

and  let  so  =  (0i-'*i0)  €  5.  {(d„, j„+i),n  >  0}  is  obviously  a  Markov  process,  and  the 
probability  law  of  a„+i  given  (tf„,s„)  is  independent  of  n  and  weakly  continuous  in  j„). 

For  any  fixed  9  €  [a, f>],  the  system  is  stable  and  {s„,n  >  0}  is  a  Markov  process  with 
steady-state  distribution  P®(-)*  Since  the  system  is  stable,  for  any  c  >  0,  there  is  a  constant 
Kg,(  >  0  large  enough  so  that  P®([0,  A‘s,j]'^'''^)  >  1  —  e.  Note  that  for  any  n  >  0,  s„  when  9 
is  fixed  is  stochastically  increasing  in  9.  Also,  s„  is  stochastically  smaller  when  9  is  allowed 
to  move  in  [a,  6]  than  when  it  is  fixed  at  ^  =  6.  The  tightness  properties  required  in  Wl 
and  W2  follow  from  these  stochastic  inequalities,  the  fact  that  lo  =  mo  =  0,  and  the  fact 
that  the  system  is  stable  when  9  £  [a, 6]. 

W3  holds  trivially.  For  W4,  take  K/  =  1/a*,  ^(s„+i)  =  Un  and  «  =  1.  When  9  is  fixed 
at  6,  m„  has  a  steady-state  distribution  with  bounded  second  moment  (see  [8]),  and  thus 
the  same  applies  to  i/„.  Because  of  the  stochastic  inequalities  above,  this  also  true  when 
9n  moves  in  [a,  6],  so  that  8up„>o  ^o(t'n)  <  o®- 

W5  follows  from  Wl  and  the  fact  that  Fi,  is  a  continuous  function  of  (0„,s„.).i). 

For  each  e  >  0,  let  be  defined  as  above.  For  each  S  compact  subset  of  S,  there  is 
a  ng  such  that  for  all  s„  €  5  and  i  >  ng,  P*(s„+,  €  [0, 2A’6,e]'^'*'*  |  s„)  >  1  —  c.  Here,  ng  can 
be  viewed  as  a  time  that  we  give  to  the  system  to  stabilize.  Roughly,  if  5  is  "bigger”,  the 
initial  state  could  be  "bigger”  (e.g.  large  initial  queue  size),  and  we  will  take  a  larger  n^. 
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When  6n  is  allowed  to  be  smaller  than  b,  this  makes  stochastically  smaller,  so  that 
this  inequality  still  holds,  and  this  implies  W6. 

From  Sun  and  Zazanis  [23,  Theorem  2  and  Corollary  1],  the  steady-state  expectation 
of  =  /(^„,s„+i)  when  0  is  fixed  is  equal  to  Vo[(^)  =  1/(1  —  0)^  —  C\/0'^.  The  equation 
X  =  if(— Va(d))  has  a  unique  asymptotically  stable  point  at  0‘  =  +  v/^)  if  this 

quantity  is  in  6,  and  at  the  nearest  boundary  of  6  otherwise,  so  W7  is  satisfied  and  this 
completes  the  proof.  I 

These  two  lemmas  show  that  for  this  exaonple,  PA  is  not  equivalent  to  “infinitesimal” 
KWC.  Looking  more  closely,  we  can  see  that  the  difference  lies  in  the  way  the  methods  are 
implemented.  In  PA  (see  [22,  23]),  the  perturbation  “accumulator”  (which  holds  in  this 
case  the  elapsed  length  of  the  current  busy  period)  is  not  reset  to  zero  at  the  beginning 
of  each  iteration.  Hence,  the  waiting  times  of  all  the  customers,  including  the  waiting 
that  occured  before  the  current  iteration  (if  any),  are  perturbed  appropriately.  For  KW  or 
KWC,  the  “perturbed”  parameter  (^nicn)  affects  only  the  service  times  that  are  generated 
during  its  current  use,  and  the  waiting  that  occurs  during  these  service  times.  Thus,  the 
elapsed  waiting  times  for  the  customers  that  are  already  in  the  queue  at  the  beginning  of 
a  simulation  subrun  are  influenced  only  by  the  parameter  values  for  the  previous  subruns. 
Obviously,  this  introduces  a  bias  on  the  estimated  derivative. 
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5.  Conclusion 


Through  a  simple  example,  we  have  seen  how  a  gradient  estimation  technique,  such  as 
finite  differences  with  common  random  numbers  (KWC),  infinitesimal  perturbation  anaJ- 
ysis  (PA)  or  likelihood  ratio  (LR),  can  be  incorporated  into  a  stochastic  approximation 
algorithm  to  get  a  provably  convergent  stochastic  optimization  method.  We  also  pointed 
out  some  dangers  associated  with  different  kinds  of  bias. 

For  the  example  considered,  PA  gave  the  best  results,  but  this  may  not  be  true  in 
general.  In  fact,  there  are  many  examples  for  which  some  of  the  methods  do  not  apply. 
For  instance,  consider  the  same  M/M/1  queue  sis  in  section  4,  but  replace  u;(0)  in  the 
objective  function  (13)  by  the  proportion  of  customers  that  wait  more  than  C2  units  of 
time,  where  C2  is  a  positive  constant.  Here,  PA  does  not  work,  because  for  a  fixed  finite 
segment  of  a  realization  u;,  the  derivative  with  respect  to  6  of  the  number  of  customers 
who  wait  more  than  C2  is  zero  with  probability  one.  As  another  example,  keep  the  same 
objective  function  as  in  section  4,  but  take  a  different  service  time  distribution:  assume 
that  the  service  time  is  1  with  probability  B  and  2  with  probability  1  —  d,  where  0  <  ^  <  1 
and  6  is  the  parameter.  Again,  PA  doesn’t  apply  (see  [23]).  Suppose  now  that  the  service 
time  is  1  wnith  probability  p  and  1  +  ^  with  probability  1  —  p,  where  p  is  fixed,  0  <  p  <  1, 
and  ^  >  0  is  the  parameter.  In  this  case,  LR  (as  described  in  section  3.4)  doesn’t  apply 
since  the  set  of  possible  realizations  depends  on  9.  We  don’t  rule  out  the  possibility  that 
eventual  adaptations  of  the  methods  might  work  for  these  cases,  but  to  our  knowledge,  this 
still  has  to  be  done.  For  many  practical  problems  for  which  a  threshold-type  parameter 
has  to  be  optimized,  neither  PA  nor  LR  do  apply,  at  least  in  their  current  forms.  This 
certainly  needs  further  research. 

The  performance  of  these  algorithms  when  there  are  many  parameters  to  optimize,  and 
the  incorporation  of  proper  variance  reduction  techniques,  are  other  interesting  subjects 
for  further  investigation.  We  mentioned  that  PA  can  be  viewed  as  a  special  case  of  LR 
with  u;  defined  in  a  special  way.  In  fact,  depending  on  how  we  view  u  in  LR,  we  may 
obtain  different  methods  to  explore.  In  principle,  PA  and  LR  can  be  used  to  estimate 
higher  order  derivatives,  but  the  variance  is  typically  quite  high.  Is  it  too  high  to  permit 
the  implementation  of  good  second  order  algorithms  based  on  these  estimates  ?  Again, 
further  investigation  is  needed. 
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A.  Appendix:  sufficient  convergence  conditions 


In  this  appendix,  we  give  sufficient  conditions  for  the  convergence  of  (1)  to  an  optimum. 
The  first  set  of  conditions  imply  almost  sure  convergence,  the  second  set  (usually  easier  to 
verify)  imply  weak  convergence.  These  conditions  are  adaptations  from  [9,  11].  They  are 
not  given  in  their  most  general  form,  but  are  general  enough  for  our  purposes  here. 

In  what  follows,  we  will  assume  that  0  is  a  compact  and  convex  set  of  the  form  6  = 
{$  €  R**  I  g{6)  <  0},  where  g(-)  is  a  K^-dimensional  vector  of  continuously  differentiable 
functions  (constraints),  as  in  [13,  chap.lO],  and  at  any  point  on  the  boundary  of  0,  the 
gradients  of  the  active  constraints  are  linearly  independent.  For  any  continuous  function 
r  :  0  — »  one  may  view  «(•)  as  the  gradient  of  some  objective  function.  Define  the  set 
of  Kuhn- Tucker  points  associated  with  v(-)  as 

KT{v{’))  =  €  R*^  I  3/i  €  R**,  m  ^  s'ich  ^hat  w(^)  -|-  n'Vg{6)  =  0,  yi!g{6)  =  o| . 

(22) 

Let  us  introduce  some  notation,  adapted  from  [9].  Define  in  =  HJLoT*  ’’^(0  = 
max{n  |  tn  <  0  for  <  >  0.  Let  i°(-)  be  the  piecewise  linear  interpolation  of  the  set  of 
points  {(fni^n)>«  >  0},  and  x’*(*)  the  left  shift  of  i°(*)  defined  by  *"(<)  =  *°(t  -t-  <„),  for 
<  >  0.  Hence,  *"(0)  =  Qnt  and  if  *'*(•)  converges  to  a  limit  i(-),  the  asymptotic  properties 
of  x(t)  as  t  — ♦  oo  can  provide  information  on  the  asymptotic  behavior  of  as  n  — ♦  oo. 
For  any  function  r  :  0  — >  R*^,  define 

Define  the  differential  equation 

X  =  #(— v(^)). 

The  set  of  asymptotically  stable  points  of  (24)  is  KT(v(-)).  The  theorems  below  give 
conditions  under  which  as  n  — »  oo,  z’’(’)  converges  in  some  sense  to  a  solution  of  (24)  for  a 
proper  v(*).  This  convergence  property  is  then  used  to  analyse  the  behavior  of  >  0}. 
We  now  give  a  list  of  assumptions  that  will  be  used  selectively  in  the  next  two  theorems. 

51.  For  all  n  >  0,  7„  >  7„+i  >  0,  and  En=o7n  =  oo. 

52.  lim„_oo  /?n  =  0  almost  surely. 


(23) 

(24) 
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S3.  For  each  T  >  0  and  c  >  0, 


Cm(jT+f)-l 

■Up  H  7.C. 


>  c  =0. 


(25) 


S4.  There  is  a  positive  sequence  {5„,  n  >  0}  such  that  <  1/^*  and 

Er=<,(7n/<»)=  <  OO. 


S5.  There  is  a  6  iifr(Va(-)),  an  asymptotically  stable  point  of 

i  =  ir(-Va(tf)), 


(26) 


with  domain  of  attraction  Da(6‘)  (in  the  sense  of  Liapounov),  and  almost  surely, 
infinitely  many  belong  to  some  compact  A  C  Da{9'). 


Theorem  1.  (Kushner  and  Clark),  (a)  A»$ume  Si  to  SS.  Then,  almost  surely,  z°(‘) 
is  uniformly  continuous  on  [0,  oo),  and  any  limit  z(>)  of  a  convergent  subsequence  of 
{*"(•), n  >  0}  satisfies  (i6).  (b)  If  6*  also  satisfies  S5,  then  lim„_oo^n  =  almost 
surely. 


Theorem  1  is  proved  in  [9,  Theorem  5.3.1}.  Condition  S3  is  quite  general,  but  has  low 
intuitive  appeal,  and  is  not  always  easy  to  verify.  The  theorem  below  uses  a  more  restricted 
but  more  “familiar”  condition.  It  is  a  variant  of  Proposition  F  in  section  II  of  [15]. 

Theorem  2.  Under  the  assumptions  Si,  St,  S4  and  S5,  the  algorithm  converges  almost 
surely  to  6*. 


PROOF.  It  stiffices  to  show  that  S3  holds,  and  the  result  will  follow  from  part  (b)  of  the 
previous  Theorem.  Note  that  under  S4,  the  sequence  n  >  0}  is  a  martingale. 

For  each  e  >  0,  from  Doob’s  inequality  and  from  S4,  we  have,  with  probability  one. 


sup 

isn 

^  tsn  ®  isn 


(27) 


for  some  constant  This  upper  bound  goes  to  zero  as  n  — f  oo.  Hence,  we  obtain 
condition  il2.2.4"  of  [9],  which  implies  S4.  I 


Often,  one  must  take  Tn  — »  oo  to  get  assumption  S2.  Sometimes,  S2  is  not  satisfied, 
but  Eolfin]  — ^  0  as  n  — »  oo,  amd  the  algorithm  might  also  converge  to  the  optimum.  These 
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cases  might  be  treated  using  the  following  (weaker)  resiilts.  An  example  will  be  given 
in  section  3.  Theorem  3  is  proved  in  [11],  with  slightly  more  general  assumptions,  while 
theorem  4  is  an  adaptation  of  the  second  part  of  Theorem  4.2.1  in  [9],  and  can  be  proved 
in  the  same  way  (note  that  in  the  last  paragraph  of  the  proof  of  Theorem  4.2.1  in  [9],  the 
max  should  be  replaced  by  a  min).  We  (pve  a  new  list  of  assumptions. 

Wl.  Yn  is  a  deterministic  function  of  (^n»^n+i)>  which  does  not  depend  on  n,  say 
Yn  =  /(^H)  An+i)<  Notice  that  might  include  some  of  the  random  values  gen¬ 
erated  during  deration  n,  or  functions  of  them,  even  if  it  is  useless  or  redundant 
information  for  the  future  evolution  of  the  system,  {(^ni^n))^  >  0}  is  a  (possibly 
nonhomogeneous)  Markov  process,  >  0}  is  tight  is  a  metric  space  5,  and 

P(sn+i  €  ■  I  =  4)t  defined  on  the  Borel  subsets  of  5,  does  not  depend  on 

n  and  is  weakly  continuous  in  (6,  s). 

W2.  For  each  fixed  ^  €  0,  i.e.  if  7n  =  0  and  9n  =  6  for  all  n,  {s„,n  >  0}  is  a  Markov 
process  with  a  unique  invariant  mesisure  P*(').  Also,  {P®(*),d  6  0}  is  tight. 

W3.  7n  >  0  for  all  n,  lim„-^oo  7n  =  =  oo  and  |7„+i  -  7„|  <  oo. 

W4.  There  are  constants  Kj  <  oo  and  k  >  0,  and  a  positive  valued  function  ^  :  5  — >  R, 
such  that  sup„>o  •^^o(|<^(«n)l*'^'‘]  <  oo  and  iy„|  <  Kj{l  -|-  (^(s„+i)). 

W5.  E[Yn  I  =  ^,  =  «]  is  continuous  in  (^,  s). 

W6.  Either  5  is  compact,  or  the  {s„,n  >  0}  are  mutually  independent,  or  for  each 
compact  5  C  5,  there  is  an  integer  ns  <  oo  such  that  for  each  T  >  0,  the  set  of 
probability  measures  {P((^„+j,s„+j)  €  •  ,  Sn  6  5  |  ^„  =  ^,s„  =  s),  ^  €  0,  n  > 
0,  j  >  ns,  t„+j  -  in  <T,S  compact  subset  of  5}  is  tight. 

W7.  Assume  Wl  to  W6  and  use  the  notation  defined  there.  Let  v  :  0  — ►  R**  be 
the  continuous  (see  W5)  function  defined  by  v(tf)  =  ^*[/(^,3)],  where  is  the 
expectation  that  corresponds  to  the  invariant  measure  P^  defined  in  W2.  There  is  a 
0*  €  KT[v{’)),  an  asymptotically  stable  point  of 

i  (28) 

with  domain  of  attraction  Da{0*)  =  0* 


29 


Theorems.  (Kxuhner  and  Shwartz).  Under  atsumpitons  Wl  to  W6,  {z"(>),n  >  0} 
is  tight  and  any  weak  limit  of  one  of  its  subsequences  satisfies  the  projected  differential 
equation  (28)  almost  everywhere  with  probability  one. 


Theorem  4.  Under  assumption  WT,  6n  converges  to  6*  in  probability,  i.e.  for  each  c  >  0, 
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